something 

Executive Summary

Our project team is VARS Consulting. NBCUniversal has contracted VARS Consulting to analyze box office performance and make strategic recommendations to assess their company’s financial performance for upcoming theatrical releases. As part of this consulting engagement we analyzed the box office data for both NBC Universal and other production companies and used it to drive key insights. In this report we detail how we have gathered and prepared the necessary information and satisfied the clients request using a multiple linear regression model. Finally, we use the selected model to predict likely box office outcomes.

Summary of Findings

something 


VARS Consulting: Data Science Process

VARS Consulting utilized the following process for this engagement:
something 


 

Data Sourcing

We gathered data from two primary sources:
- Internet Movie Database (IMDB): has information related to movies and television shows such as, cast, budgets, plots, reviews etc.
imdbid title plot rating imdb_rating metacritic dvd_release production actors imdb_votes poster director release_date runtime genre awards keywords Budget Box.Office.Gross
tt0010323 The Cabinet of Dr. Caligari Hypnotist Dr. Caligari uses a somnambulist, Cesare, to commit murders. UNRATED 8.1 N/A 15-Oct-97 Rialto Pictures Werner Krauss, Conrad Veidt, Friedrich Feher, Lil Dagover 42,583 https://images-na.ssl-images-amazon.com/images/M/MV5BMTY1NzIxOTcxM15BMl5BanBnXkFtZTgwMjY0ODgwNzE@._V1_SX300.jpg Robert Wiene 19-Mar-21 67 min Fantasy, Horror, Mystery 1 nomination. expressionism|somnambulist|avant-garde|hypnosis|fair|visit|murder|asylum|violence|opening-a-door|death|flashback-within-a-flashback|costume-horror|flashback|gothic|surrealism|enigma|kidnapping|good-versus-evil|sleepwalking|mind-control|macabre|carnival|mannequin|insanity|evil-doctor|diabolical|madman|megalomania|megalomaniac|tragic-villain|mad-scientist|sideshow|hypnotism|psychopath|somnambulism|psychiatrist|surprise-ending $18,000 0
tt0052893 Hiroshima Mon Amour A French actress filming an anti-war film in Hiroshima has an affair with a married Japanese architect as they share their differing perspectives on war. NOT RATED 8 N/A 24-Jun-03 Rialto Pictures Emmanuelle Riva, Eiji Okada, Stella Dassas, Pierre Barbaud 21,154 https://images-na.ssl-images-amazon.com/images/M/MV5BMjMyNDYzMzU5OV5BMl5BanBnXkFtZTgwNTUxNzU4MjE@._V1_SX300.jpg Alain Resnais 16-May-60 90 min Drama, Romance Nominated for 1 Oscar. Another 6 wins & 5 nominations. memory|atomic-bomb|lovers-separation|impossible-love|nuclear-bomb|radiation-victim|nuclear-radiation|hiroshima-japan|post-war|nuclear-weapons|peace|german-soldier|actress|japanese-man|anti-war|first-love|oblivion|japan|nouvelle-vague|humanism|death-of-lover|separation|sleepless-night|voice-over-inner-thoughts|locked-in-a-cellar|tragic-love|traumatic-experience|20th-birthday|death-of-loved-one|death-of-boyfriend|adulterer|adulteress|adulterous-desire|memorial-park|war-memorial|peace-demonstration|film-in-film|nevers-france|two-in-a-shower|survival|archive-footage|museum|radiation-burn|radiation-poisoning|1950s|1940s|city-in-title|three-word-title|french-new-wave|obliviousness|first-person-title|cult-film|place-name-in-title|claim-in-title|part-documentary|asian-man-white-woman-relationship|world-war-two|nonlinear-timeline|jump-cut|flashback|extramarital-affair|surrealism|franco-japanese|hotel|hotel-room|frenchwoman|interracial-love|interracial-couple|interracial-romance|france $88,300 0
tt0058898 Alphaville A U.S. secret agent is sent to the distant space city of Alphaville where he must find a missing person and free the city from its tyrannical ruler. NOT RATED 7.2 N/A 20-Oct-98 Rialto Pictures Eddie Constantine, Anna Karina, Akim Tamiroff 17,801 https://images-na.ssl-images-amazon.com/images/M/MV5BMzk2MTlkM2EtNzNhYi00Y2YxLWIwODktNGQ0NDM2ZTgwODJiXkEyXkFqcGdeQXVyNjI5NTk0MzE@._V1_SX300.jpg Jean-Luc Godard 5-May-65 99 min Drama, Mystery, Sci-Fi 1 win. dystopia|french-new-wave|satire|comic-violence|surrealism|nouvelle-vague|avant-garde|neo-noir|secret-agent|lemmy-caution|future|dictionary|conscience|computer|alternate-reality|hard-boiled|spying|spy-hero|spy|reference-to-james-bond|eurospy|espionage|french-science-fiction|car-chase|social-satire|ford-mustang|violence|swimming-pool|spoof|sexism|science-runs-amok|riddle|neon|negative-footage|mind-control|mathematical-equation|gun-violence|galaxy|forbidden-speech|evil-computer|public-execution|totalitarianism|philosophy|bible|nudity|utopia-quest|dictator|detective|metropolis|artificial-intelligence|spiral-staircase|based-on-novel|character-name-in-title $220,000 $46,585
tt0074252 Ugly, Dirty and Bad Four generations of a family live crowded together in a cardboard shantytown shack in the squalor of inner-city Rome. They plan to murder each other with poisoned dinners, arson, etc. The … N/A 7.9 N/A 1-Nov-16 Compagnia Cinematografica Champion Nino Manfredi, Maria Luisa Santella, Francesco Anniballi, Maria Bosco 5,705 https://images-na.ssl-images-amazon.com/images/M/MV5BMTEwMzkwMDgxNTdeQTJeQWpwZ15BbWU4MDc3MzM1MzAy._V1_SX300.jpg Ettore Scola 23-Sep-76 115 min Comedy, Drama 1 win & 2 nominations. incest|failed-murder-attempt|poisoned-food|baptism|planning-a-murder|woman-in-a-wheelchair|cantankerous-old-woman|tv-reporter|domestic-violence|woman-with-mustache|brother-in-law-sister-in-law-sex|avarice|cupidity|dysfunctional-family|burnt-face|sexual-promiscuity|money-roll|misery|family-patriarch|traveling-salesman|promiscuity|drunkenness|greed|scooter|nude-model|nude-photograph|large-family|360-degree-pan|vomiting|sex|lumpenproletariat|consumerism|dream|black-comedy|sea|pregnancy|extramarital-affair|prostitute|male-prostitute|crossdresser|poisoning|long-take|rome-italy|family-relationships|commedia-all’italiana|woman-on-top|slum|italy|1970s|poison|burlesque|poverty $6,590 0
tt0084269 Losing Ground A comedy-drama about a Black American female philosophy professor and her insensitive, philandering, and flamboyant artist husband who are having a marital crisis. When the wife goes off on… N/A 6.3 N/A N/A Milestone Film & Video Billie Allen, Gary Bolling, Clarence Branch Jr., Joe Garcia 132 https://images-na.ssl-images-amazon.com/images/M/MV5BMTUwMzQzNDg0MV5BMl5BanBnXkFtZTgwMDgwMDUwODE@._V1_SX300.jpg Kathleen Collins 1-Jun-82 86 min Comedy, Drama N/A artist|painter|marriage|black-independent-film|independent-film|professor|f-rated|written-by-director|title-directed-by-female|female-director|swimming-pool|swimming|filming|kissing|painting|two-word-title|black-middle-class|middle-class|middle-age-couple|love-triangle|philosopher 0 0
tt0085180 L’argent A forged 500-franc note is cynically passed from person to person and shop to shop, until it falls into the hands of a genuine innocent who doesn’t see it for what it is - which will have … N/A 7.5 95 24-May-05 Criterion Collection Christian Patey, Vincent Risterucci, Caroline Lang, Sylvie Van den Elsen 5,607 https://images-na.ssl-images-amazon.com/images/M/MV5BY2RlYTc2ZGUtMGFlNS00ZGUxLWEzODYtYjJhY2RmOWRkYzY4L2ltYWdlXkEyXkFqcGdeQXVyNDQzMDg4Nzk@._V1_SX300.jpg Robert Bresson 18-May-83 85 min Crime, Drama 2 wins & 3 nominations. note|murder|solitary-confinement|robbery|delivery-man|camera|bank-robbery|money|pushing|table|objectified-woman|unreliable-employee|woman-in-bed|old-woman-murdered|murdered-in-a-bed|pretending-to-take-medicine|wristwatch|working-class|wine|whiskey-bottle|wheelchair|wheelbarrow|wealth|washing-clothes|waiter|vengeance|valium|trial|toy-store|thief|theft|telephone-call|teenage-boy|teacher|suitcase|subway|stranger|sticking-out-one’s-tongue|stealing|stakeout|sorting-mail|sleeping|sleeping-in-a-barn|sister-sister-relationship|sink|sidewalk-cafe|shopkeeper|shop-window|serving-ladle|searching|scrapbook|school|schoolboy|running|returned-mail|return-to-sender|restaurant|release-from-prison|redemption|reckless-driving|pursuit|purse|punishment|prisoner|prison-visitation|prison-guard|prison-discharge|prison-cell|prison-break|prison-alarm|priest|prank|policeman|police-van|police-station|police-car|pitchfork|pill|picture-frame|piano|piano-teacher|piano-player|photographer|perjury|passing-note|pajamas|pacing|newspaper|murderer|murder-of-family|mother-son-relationship|mother-daughter-relationship|moped|metro|mass|mass-murder|map|mail|magazine|loan|lie|liar|letter|letter-censorship|lawyer|lantern|knife|knees|key|judge|ironing|invoice|investigator|investigation|imprisonment|husband-wife-relationship|hunger|humiliation|hotel|hospital|helmet|heart-monitor|headmaster|handcuffs|gun|gunshot|guilt|greed|getaway-car|gas|gas-delivery-man|garden|friend|friendship|fraud|france|forgiveness|forgery|footbridge|food|following|floor-polisher|fleeing|father-son-relationship|father-daughter-relationship|family-relationships|pretending-to-take-a-pill|face-slap|escape|fired-from-the-job|elevator|drink|drinking|drain|dog|dining-hall|digging-for-potatoes|death|death-of-husband|death-of-daughter|darkroom|custody|cross|court|courtroom|corruption|confession|coffee|clothes-line|class|classroom|cigarette-smoking|check|chase|charity|cell-mate|catholic|catholic-church|cash-register|car-accident|camera-store|cafe|burglary|burglar-alarm|broken-glass|broken-dish|breaking-and-entering|bread|blood|betrayal|beating|bakery|axe-murder|arrest|ambulance|alarm|accusation|accomplice|suicide-attempt|sleeping-pills|photo-shop|multiple-murder|master-key|foreign-language-adaptation|false-testimony|chain-reaction|based-on-short-story|scam|ex-convict|counterfeit|police|axe|prison|death-of-child|female-nudity 0 0
The description of these fields is as follows:
variable description
imdbid Unique ID used by IMDB to refer to the movie.
title Title of the movie
plot Movie plot summary
rating MPAA appropriate audience rating
imdb_rating IMBD voters scoring of a movie on a scale from 1-10 (10 being best)
metacritic Metacritic movie score on a scale of 0-100 (100 being best)
dvd_release Movie release date on DVD
production Principal production company
actors Lead actors
imdb_votes Total votes from IMDB members.
poster Movie poster artwork
director Movie director
release_date Theatrical release date
runtime Runtime length of movie in minutes
genre Genre classification
awards Academy awards & nominations
keywords Keywords associated with the movie
Budget Budget spent on the movie production, marketing, and distribution.
box office gross Box office gross returns as of 9/21/2017
  • BoxOfficeMojo.com: a box office reporting website that has budget data for movies. Here we used seasonal box office information.
Season.Start Season.End Box.Office.Season Season.Gross Season.YoY Season.Days Season.Daily.Avg Season.Movie.Count Season.Move.Avg.Gross
1/5/07 3/1/07 Winter 890.1 -0.024 58 15.3 70 12.7
3/2/07 5/3/07 Spring 1342.3 -0.002 62 21.7 116 11.6
5/4/07 9/3/07 Summer 4210.5 0.128 122 34.5 218 19.3
9/4/07 11/1/07 Fall 947.9 -0.138 58 16.3 120 7.9
11/2/07 1/4/08 Holiday 2299.9 0.082 60 38.3 107 21.5
1/5/08 3/6/08 Winter 1052.7 0.183 64 16.4 99 10.6

Data Combining

We created Excel formulas to identify the season that each release date belonged to so that we could have a master dataset to work with. This final master dataset looked like this:
imdbid title plot rating imdb_rating metacritic dvd_release production actors imdb_votes poster director release_date runtime genre awards Budget Box.Office.Gross Box.Office.Season Season.Gross Season.YoY.Change Season.Days Season.Daily.Avg Season.Movie.Count Season.Movie.Avg keywords
tt0200465 The Bank Job Martine offers Terry a lead on a foolproof bank hit on London’s Baker Street. She targets a roomful of safe deposit boxes worth millions in cash and jewelry. But Terry and his crew don’t realize the boxes also contain a treasure trove of dirty secrets - secrets that will thrust them into a deadly web of corruption and illicit scandal. R 7.3 69 15-Jul-08 Lionsgate Jason Statham, Saffron Burrows, Stephen Campbell Moore, Daniel Mays 158,562 https://images-na.ssl-images-amazon.com/images/M/MV5BMTUwMzc1MDMxOV5BMl5BanBnXkFtZTcwODY4OTIzMw@@._V1_SX300.jpg Roger Donaldson 7-Mar-08 111 min Crime, Drama, Romance 3 nominations. 20000000 30028592 Spring 1074900000 -0.20 55 19543636 111 9700000 safe-deposit|heist|chase|mobster|london-england|bank|pornographer|blackmail|secret-service|bank-vault|crooked-policeman|heist-movie|bank-heist|walkie-talkie|torture|robbery|tunnel|murder|car-dealer|brothel|airport|train-station|revolver|weapon|ford-transit|ford|van|car-salesman|based-on-true-events|woman|20th-century|england|united-kingdom|champagne|red-wine|beer|female-frontal-nudity|gunfight|shootout|fistfight|sex-scene|kissing-while-having-sex|kiss|neo-noir|violence|incriminating-photograph|s&m|casual-sex|black-activist|paparazzi|year-1971|period-piece|dutch-angle|debt|peace-sign|masochism|doublecross|extortion|recruiting|planning|stripper|caper|mafia|suffocation|strangulation|stabbed-in-the-back|pistol|double-cross|death|window-smashing|what-happened-to-epilogue|wedding-reception|vulgarity|trinidad|tailor|subway|strip-club|stabbing|shot-in-the-head|rooftop|revolutionary|railway-station|pub|prologue|princess|pornographic-film|political-corruption|police-corruption|photograph|parking-garage|open-grave|nonlinear-timeline|menage-a-trois|marriage|machete|jackhammer|infidelity|hidden-camera|ham-radio|fishing-boat|female-nudity|fashion-model|drug-smuggling|customs|courtroom|caribbean|cabinet-officer|brick|bracelet|book-party|beach|basement|assault|ambulance|1970s|dominatrix|death-of-friend|based-on-true-story
tt0315642 Wazir A grief-stricken cop and an amputee grandmaster are brought together by a peculiar twist of fate as part of a wider conspiracy that has darkened their lives. N/A 7.2 N/A N/A Rajkumar Hirani Films Amitabh Bachchan, Farhan Akhtar, Aditi Rao Hydari, Manav Kaul 12,764 https://images-na.ssl-images-amazon.com/images/M/MV5BMTUzNDU4NDMyOV5BMl5BanBnXkFtZTgwNjcyNzU0NzE@._V1_SX300.jpg Bejoy Nambiar 8-Jan-16 103 min Action, Crime, Drama 1 nomination. 586028 0 Winter 1144900000 0.02 59 19405085 86 13300000 chess-grandmaster|chess|race-against-time|one-word-title|character-name-in-title
tt0323808 The Wicker Tree Charmed by the residents of Tressock, Scotland, two young missionaries accept the invitation to participate in a local festival, fully unaware of the consequences of their decision. R 3.9 N/A 24-Apr-12 Anchor Bay Entertianment Brittania Nicol, Henry Garrett, James Mapes, Lesley Mackie 2,155 https://images-na.ssl-images-amazon.com/images/M/MV5BMTkyNzkyODE5N15BMl5BanBnXkFtZTcwNjUxNzIxNw@@._V1_SX300.jpg Robin Hardy 27-Jan-12 96 min Drama, Horror N/A 7750000 0 Winter 1243900000 0.36 58 21446552 88 14100000 sex-scene|female-nudity|folk-horror|british-horror|supernatural-horror|three-word-title|plant-in-title|satire|black-comedy|second-part|sequel
tt0326965 In My Sleep Marcus is a popular massage therapist who struggles with parasomnia, a severe sleepwalking disorder that causes him to do things in his sleep that he cannot remember the next day. When he … PG-13 5.6 33 1-Oct-10 Morning Star Pictures Philip Winchester, Tim Draxl, Lacey Chabert, Abigail Spencer 1,741 https://images-na.ssl-images-amazon.com/images/M/MV5BNzg1MDM1NzIwMV5BMl5BanBnXkFtZTcwNzMxMTU1MQ@@._V1_SX300.jpg Allen Wolf 23-Apr-10 104 min Drama, Mystery, Thriller 6 wins. 1000000 57190 Spring 1626800000 0.30 62 26238710 93 17500000 knife|flashback|falling-down-stairs|policeman|cemetery|surprise-party|spa|swimming-pool|handcuffs|man-in-swimsuit|beefcake|bare-chested-male|vegetarian|vegan|independent-film
tt0327597 Coraline An adventurous girl finds another world that is a strangely idealized version of her frustrating home, but it has sinister secrets. PG 7.7 80 21-Jul-09 Focus Features Dakota Fanning, Teri Hatcher, Jennifer Saunders, Dawn French 159,786 https://images-na.ssl-images-amazon.com/images/M/MV5BMzQxNjM5NzkxNV5BMl5BanBnXkFtZTcwMzg5NDMwMg@@._V1_SX300.jpg Henry Selick 6-Feb-09 100 min Animation, Family, Fantasy Nominated for 1 Oscar. Another 7 wins & 43 nominations. 60000000 75286229 Winter 1227100000 0.17 62 19791935 64 19200000 parallel-worlds|stop-motion|scissors|new-home|eye|dream|secret-door|cat|rescue|talking-cat|spiderweb|seashell-bikini|spiral-staircase|scene-after-end-credits|puppet-animation|garment-button|lifting-someone-into-the-air|thunderstorm|monster|crying|shadow|loneliness|boat|rainbow|blood|orphan|woods|forename-as-title|one-word-title|bechdel-test-passed|husband-wife-relationship|little-girl|moving-crew|pet-cat|stuffed-animal|pet-dog|old-woman|thunder|letter|bicycling|little-boy|dowsing|forest|movers|search|clue|flashlight|fear|riddle|tears|fireplace|stage|anger|spotlight|beetle|big-top|child-protagonist|female-protagonist|search-for-parent|sleep|void|alternate-world|camera|walker|stabbing-a-doll|cane|hanging-from-a-flagpole|angel|snow|eccentric|pizza|lemonade|stuffed-toy-dog|moving|swinging-on-a-door|stuffed-animal-toy|reflection|motorcycle|other-father|tea|catalogue|gloves|clothing-store|toy-chest|texas|michigan|suitcase|rain|hummingbird|flowers|breaking-mirror|candy|disappearance-of-one’s-father|disappearance|running-away|dinner|theatre-audience|theatre-production|diving-into-a-barrel|song|singing|singer|shakespearean-quotation|reference-to-william-shakespeare|lightning|moving-van|mermaid|actress|cheese|balancing-on-a-balcony-railing|balcony|parachute|knitting-needle|mirror|computer|voice-over-narration|wallpaper|corset|cell-phone|dog|oregon|player-piano|pianist|eyeglasses|photograph|skipping-stone|milkshake|tent|pajamas|presidents’-day|lorgnette|hide-and-seek|cake|candle|bedroom|nightmare|sleeping|prayer|eating|food|buxom|boy|mother-daughter-relationship|father-daughter-relationship|sewing|thread|needle|horror-for-children|dark-fantasy|theatre|blue-hair|tunnel|trapeze|toy-train|tickling|tea-leaves|snowglobe|slug|praying-mantis|peeling-skin|mouse|high-dive|garden|fortune-telling|fog|dowsing-rod|cannon|bicycle|bat|poison-oak|mirror-as-portal|blowing-a-raspberry|well-shaft|rat|old-mansion|metamorphosis|mechanical-hand|insect|grandmother-grandson-relationship|eye-cut-out|cat-and-mouse|bug|acrobat|impostor|talking-animal|secret-passage|piano|old-dark-house|neighbor|kidnapping|key|ghost|ghost-child|game-playing|fantasy-world|doll|cotton-candy|circus|surrealism|3-dimensional|cult-film|stop-motion-animation|death-of-mother|based-on-novel|title-spoken-by-character|character-name-in-title|laptop-computer
tt0337584 Backseat A “coming of age” story where two old friends flee from New York City on a three-day road trip to Montreal, Canada to escape their problems. N/A 7.0 32 N/A Truly Indie Josh Alexander, Starla Benford, William Bogert, Robert T. Bogue 69 https://images-na.ssl-images-amazon.com/images/M/MV5BMTUwMDk1ODczMl5BMl5BanBnXkFtZTcwMzU2OTc1MQ@@._V1_SX300.jpg Bruce Van Dusen 28-Mar-08 80 min Comedy 1 win. 12343 0 Spring 1074900000 -0.20 55 19543636 111 9700000 road-trip|highway-travel|road-movie|on-the-road

 

Feature Evaluation

The table below indicates how all the orignal features were evaluated and modified.
Feature Outcome Evaluation
imdbid Eliminated Determined to not be meaningful for a multiple linear regression approach.
title Eliminated Determined to not be meaningful for a multiple linear regression approach.
plot Eliminated Determined to not be meaningful for a multiple linear regression approach.
rating Cleaned Category needed cleaning, so we collapsed in to fewer number of like categories
imdb_rating Left As-Is Used continuous variable without transformation
metacritic Left As-Is Used continuous variable without transformation
dvd_release Eliminated Determined to not be meaningful for a multiple linear regression approach.
production Split & Cleaned Kept only the first item in the comma delimited list, then collapsed in to fewer categories. We grouped companies together that were misspelled or named only slightly different.
actors Split Split the first and second items in a comma separated list as Lead1 and Lead2.
imdb_votes Left As-Is Used continuous variable without transformation
poster Eliminated Determined to not be meaningful for a multiple linear regression approach.
director Split Split the first item in a comma separated list as Director.
release_date Eliminated Determined to not be meaningful for a multiple linear regression approach.
runtime Left As-Is Used continuous variable without transformation
genre Split Split the first and second items in a comma separated list as genre1 and genre2.
awards Eliminated Determined to not be meaningful for a multiple linear regression approach.
keywords Eliminated Determined to not be meaningful for a multiple linear regression approach.
Budget Left As-Is Used continuous variable without transformation
box office gross Left As-Is Used continuous variable without transformation
Box Office Season Left As-Is Used continuous variable without transformation
Season Gross Left As-Is Used continuous variable without transformation
Season YoY Left As-Is Used continuous variable without transformation
Season Days Left As-Is Used continuous variable without transformation
Season Daily Avg Left As-Is Used continuous variable without transformation
Season Movie Count Left As-Is Used continuous variable without transformation
Season Move Avg Gross Left As-Is Used continuous variable without transformation
Year Left As-Is Used continuous variable without transformation

Observation Evaluation

Our client asked us to focus strictly on their live-action, feature films that were released in the US over the last 5 years. Therefore, we removed observations with the following attributes:
- International Movies: Movies from production companies outside the US were removed, and were not factored in to our exploratory data analysis.
- Genre filtering: Movies with genre1 of ‘Animation’, or ‘Documentary’ were filtered out at client’s request.
- Release Date filtering: Movies prior to July of 2012 were filtered out at the client’s request.
- Runtime filtering: Only movies with a runtime of 80 minutes or longer are officially recognized as feature films. Therefore, due to client’s requirements, and film with a runtime less that 80 minutes was eliminated.

Missing Data Remediation

Once we had filtered out observations that were not to be used we still had some missing values that need to be dealt with. Here is how we handled those features with missing values:
- Budget & Box Office Gross: Any movie that had been released but budget and/or box office gross total could not be obtained were eliminated from the dataset. We did not feel these values could be reliably imputed due to vast variance in these numbers.
- IMDB rating, IMDB votes, &Metacritic Score: These values were fairly normally distributed so we imputed missing values by using the feature’s median value.

Feature Engineering

At VARS Consulting, our domain knowledge allows us to create meaningful features using source data to enhance the model input data and provide the best possible results. We engineered, or added, several features based on proprietary criteria to enable very high cardinality categorical data such as Production Company, Director, & Actors to provide meaning. This is especially needed because past success for Directors and Actors can heavily influence future movie performance. Our methodology was as follows:
Box Office Performance Points
>Each movie was ranked according to lifetime box office performance. We then identify movies that were ranked 1-10 as Top 10 movies, movies ranked 11-50 as Top 50 movies, and movies ranked 51-250 as Top 250 movies. Each Production, Director and Actor in the data set was then awarded 250 performance points for each movie in the Top 10, 50 performance points for each movie in the Top 50, and 10 performance points for each movie in the Top 250.
[Performance Points] = [ SUM(Top 10 Count) x 250 ] + [ SUM(Top 50 Count) x 50 ] + [ SUM(Top 250 Count) x 10 ]

An example, for Directors, shows high ranking directors according to the formula and gives us a meaningful continuous variable to assist with modeling:
Director Sum.of.Top.10 Sum.of.Top.50 Sum.of.Top.250 Points
Christopher Nolan 2 1 2 570
Joss Whedon 2 0 0 500
Bill Condon 1 2 0 350
J.J. Abrams 1 0 3 280
Gareth Edwards 1 0 1 260
Andrew Stanton 1 0 0 250
Colin Trevorrow 1 0 0 250
James Cameron 1 0 0 250
Francis Lawrence 0 3 0 150
Jon Favreau 0 3 0 150
Michael Bay 0 2 2 120
David Yates 0 2 1 110
Zack Snyder 0 2 1 110
James Gunn 0 2 0 100
Clint Eastwood 0 1 2 70
Peter Jackson 0 1 2 70
Todd Phillips 0 1 2 70
Anthony Russo 0 1 1 60
Byron Howard 0 1 1 60
Chris Renaud 0 1 1 60
James Wan 0 1 1 60
Kyle Balda 0 1 1 60
Pierre Coffin 0 1 1 60
Steven Spielberg 0 1 1 60

Importing The Data

Finally, we brought our cleaned and prepared dataset in to R to begin data analysis and model building.

Box.Office.Gross Season.Gross Season.YoY.Change Season.Days Season.Daily.Avg Season.Movie.Count Season.Movie.Avg Budget runtime imdb_rating imdb_votes metacritic Director.Perf.Pts Lead1.Perf.Points Lead2.Perf.Points Rating.Genre.Perf.Pts Box.Office.Season rating.group genre1
31537320 1397500000 -0.059 55 25409091 113 12400000 1e+06 83 5.6 50796 59 0 0 0 0 Spring R Drama
64473115 4851100000 0.127 122 39763115 232 20900000 3e+06 85 5.7 155490 41 0 0 0 0 Summer R Horror
35385560 1117700000 0.017 62 18027419 103 10900000 4e+06 91 4.6 30304 30 0 0 0 10 Winter R Mystery
2184640 1277200000 -0.161 58 22020690 133 9600000 5e+06 118 4.1 5359 42 10 0 0 60 Fall PG Adventure
65206105 1277200000 -0.161 58 22020690 133 9600000 5e+06 94 6.2 82263 55 20 0 0 10 Fall PG-13 Horror
50856010 1522300000 0.325 65 23420000 159 9600000 5e+06 89 4.4 37506 38 0 0 0 10 Fall PG-13 Horror

 

Exploring The Data

Exploratory data analysis is an approach that utilizes various techniques to detect any mistakes, check underlying assumptions and roughly determine the relationship among the explanatory variables. Some EDA techniques are graphical in nature whereas some are quantitative.

Depending on the type of data that has to be explored, the exploratory data analysis can be of the following:

  • Univariate Non-graphical
  • Multivariate Non-graphical
  • Univariate Graphical
  • Multivariate Graphical

We checked some box-plots as they show robust measures of location and spread along with information about symmetry and outliers. Similarly histograms were reviewed as they quickly depict the central tendency and modality of the data.

In case of our client, we considered certain numerical values for conducting exploratory data analysis. The run time of the movies, number of votes on IMDb in last 5 years and the available budget of every movie was studied from its graphical nature. Here are some of our findings:

     

In addition, both the season the movie is released and the rating given by the MPAA can have a significant impact on the box office performance as seen in these graphs:

         

The combination of rating, and season also creates significant variation in performance:

 

As per the EDA we observed that the data was skewed for certain parameters, and needed smoothing. We cannot always rely on data uploaded by the source and therefore, we had to make some modifications and clean the data to make it relevant for superior analysis.


 

Multiple Linear Regression - Initial Assessment

At first, we simply build a model that utilizes all data to evaluate the potential of a linear model for the client:

bo <- lm(Box.Office.Gross ~ .
         ,data = all_vars)
summary(bo)
## 
## Call:
## lm(formula = Box.Office.Gross ~ ., data = all_vars)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -60239091 -16845344         0  18832919  64086533 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)   
## (Intercept)              -2.304e+08  3.563e+08  -0.647  0.52229   
## Season.Gross             -1.151e-01  8.783e-02  -1.310  0.19926   
## Season.YoY.Change        -1.007e+06  7.604e+07  -0.013  0.98951   
## Season.Days               1.844e+06  4.938e+06   0.374  0.71114   
## Season.Daily.Avg          1.099e+00  9.605e+00   0.114  0.90961   
## Season.Movie.Count        1.049e+06  2.078e+06   0.505  0.61713   
## Season.Movie.Avg          6.840e+00  2.059e+01   0.332  0.74190   
## Budget                    4.764e-01  2.618e-01   1.820  0.07791 . 
## runtime                   5.359e+05  6.344e+05   0.845  0.40434   
## imdb_rating               5.441e+06  1.236e+07   0.440  0.66266   
## imdb_votes                6.128e+01  8.942e+01   0.685  0.49795   
## metacritic                3.425e+05  6.344e+05   0.540  0.59290   
## Director.Perf.Pts         1.236e+06  7.323e+05   1.688  0.10085   
## Lead1.Perf.Points         8.152e+05  4.478e+05   1.821  0.07776 . 
## Lead2.Perf.Points        -6.930e+04  4.283e+05  -0.162  0.87246   
## Rating.Genre.Perf.Pts    -3.369e+04  9.975e+03  -3.378  0.00189 **
## Box.Office.SeasonHoliday  7.215e+07  1.483e+08   0.487  0.62980   
## Box.Office.SeasonSpring   7.077e+07  3.318e+07   2.133  0.04045 * 
## Box.Office.SeasonSummer   9.416e+07  2.290e+08   0.411  0.68365   
## Box.Office.SeasonWinter   4.336e+07  5.122e+07   0.846  0.40340   
## rating.groupPG-13        -5.887e+07  7.615e+07  -0.773  0.44496   
## rating.groupR            -5.837e+07  7.395e+07  -0.789  0.43554   
## genre1Adventure          -6.989e+07  4.662e+07  -1.499  0.14331   
## genre1Biography          -4.037e+07  3.266e+07  -1.236  0.22528   
## genre1Comedy             -7.419e+06  1.961e+07  -0.378  0.70758   
## genre1Crime              -4.551e+07  2.931e+07  -1.553  0.13004   
## genre1Drama              -1.239e+07  2.790e+07  -0.444  0.65983   
## genre1Horror              3.977e+07  2.717e+07   1.464  0.15272   
## genre1Mystery             5.218e+06  4.768e+07   0.109  0.91351   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38920000 on 33 degrees of freedom
## Multiple R-squared:  0.9137, Adjusted R-squared:  0.8404 
## F-statistic: 12.47 on 28 and 33 DF,  p-value: 8.405e-11
par(mfrow = c(2, 2))
plot(bo)

Multiple Linear Regression - Best Subset

The initial look makes us believe we have a data structure conducive to linear regression, but we need to find the simplest combination of dependent variables that still produces acceptable results. To do this we employ a best subset approach:

# BEGIN best subset approach
#

best.subset <- regsubsets(Box.Office.Gross~., all_vars, nvmax = 30, nbest = 10, really.big = T)
best.subset.summary <- summary(best.subset)

# Show plots evaluating the results of best subset approach
par(mfrow=c(2,2))
plot(best.subset$rss, xlab="Number of Variables", ylab="RSS", type="l")
plot(best.subset.summary$adjr2, xlab="Model Index Number", ylab="Adjusted RSq", type="l")
plot(best.subset.summary$cp, xlab="Model Index Number", ylab="CP", type="l")
plot(best.subset.summary$bic, xlab="Model Index Number", ylab="BIC", type="l")

In addition to R-squared, Adjusted R-squared, Cp, & BIC values we need to understand VIF and Durbin Watson values of each prospective model to make the best possible selection. VARS Consulting adds these values to the best subset data frame as follows:

bt <- best.subset.summary$which
best.subset.tests <- data.frame(vif=double(),dwtval=double(),dwpval=double())
for (i in 1:length(bt[,2])) {
  loop_df <- all_vars
  for (j in 1:13) {
    if (bt[i,j] == FALSE) {
      loop_df <- loop_df[,!names(loop_df) %in% colnames(bt)[j]]
      #print(colnames(bt)[j])
    } #else {print("It's false")}
  }
  vif_val <- -999
  dwt_val <- -999
  dwp_val <- -999
  tryCatch({
    vif_val <- max(vif(lm(Box.Office.Gross ~.,data = loop_df))[,3])
    dwt_val <- durbinWatsonTest(lm(Box.Office.Gross ~.,data = loop_df))$dw
    dwp_val <- durbinWatsonTest(lm(Box.Office.Gross ~.,data = loop_df))$p
  }, error=function(e){})
  best.subset.tests[i,1] <- vif_val
  best.subset.tests[i,2] <- dwt_val
  best.subset.tests[i,3] <- dwp_val
}


all_results <- data.frame(best.subset.summary$rsq,best.subset.summary$adjr2,best.subset.summary$cp,best.subset.summary$bic,best.subset.tests)

Finally, we filter the data frame to find the prospective model that meets all of our criteria thresholds:

# Filter data frame to find models that meet all criteria
all_results[all_results$vif < 5 & all_results$vif > 0 & all_results$best.subset.summary.rsq > .7 & all_results$best.subset.summary.adjr2 > 0.7 & all_results$dwtval > 1.95 & all_results$dwtval < 2.05,] #& all_results$dwtval > 2.00
##    best.subset.summary.rsq best.subset.summary.adjr2
## 82               0.8777332                 0.8565717
## 89               0.8739507                 0.8521345
##    best.subset.summary.cp best.subset.summary.bic      vif   dwtval dwpval
## 82               4.737516               -89.02476 4.928035 2.029804  0.828
## 89               6.183427               -87.13574 4.610032 2.032820  0.836
# Show variables of selected model
best.subset.summary$which[82,]
##              (Intercept)             Season.Gross        Season.YoY.Change 
##                     TRUE                    FALSE                    FALSE 
##              Season.Days         Season.Daily.Avg       Season.Movie.Count 
##                    FALSE                     TRUE                    FALSE 
##         Season.Movie.Avg                   Budget                  runtime 
##                    FALSE                     TRUE                    FALSE 
##              imdb_rating               imdb_votes               metacritic 
##                    FALSE                    FALSE                    FALSE 
##        Director.Perf.Pts        Lead1.Perf.Points        Lead2.Perf.Points 
##                     TRUE                     TRUE                    FALSE 
##    Rating.Genre.Perf.Pts Box.Office.SeasonHoliday  Box.Office.SeasonSpring 
##                     TRUE                     TRUE                     TRUE 
##  Box.Office.SeasonSummer  Box.Office.SeasonWinter        rating.groupPG-13 
##                     TRUE                    FALSE                    FALSE 
##            rating.groupR          genre1Adventure          genre1Biography 
##                    FALSE                    FALSE                    FALSE 
##             genre1Comedy              genre1Crime              genre1Drama 
##                    FALSE                    FALSE                    FALSE 
##             genre1Horror            genre1Mystery 
##                     TRUE                    FALSE

Multiple Regression - Final Model Selection

The model chosen, based on our proprietary best subset approach tells us that the simplest model that performs best includes:
- Season Daily Average
- Budget
- Director Performance Points
- Lead Actor 1 Performance Points
- Rating Genre Performance Points
- Box Office Season
- Genre 1

This is good confirmation that VARS Consulting feature engineering efforts were of significant value to the final model.

bo4 <- lm(Box.Office.Gross ~
            Season.Daily.Avg
            + Budget
            + Director.Perf.Pts
            + Lead1.Perf.Points
            + Rating.Genre.Perf.Pts
            + Box.Office.Season
            + genre1
            ,data = all_vars)

summary(bo4)
## 
## Call:
## lm(formula = Box.Office.Gross ~ Season.Daily.Avg + Budget + Director.Perf.Pts + 
##     Lead1.Perf.Points + Rating.Genre.Perf.Pts + Box.Office.Season + 
##     genre1, data = all_vars)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -92277096 -15242764  -4034985  18327847  73913164 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               1.552e+08  5.301e+07   2.927 0.005346 ** 
## Season.Daily.Avg         -6.046e+00  2.314e+00  -2.613 0.012154 *  
## Budget                    4.714e-01  1.678e-01   2.810 0.007309 ** 
## Director.Perf.Pts         1.028e+06  5.261e+05   1.954 0.056929 .  
## Lead1.Perf.Points         1.033e+06  3.784e+05   2.729 0.009028 ** 
## Rating.Genre.Perf.Pts    -3.191e+04  8.196e+03  -3.894 0.000324 ***
## Box.Office.SeasonHoliday  1.501e+08  5.615e+07   2.672 0.010450 *  
## Box.Office.SeasonSpring   8.214e+07  2.451e+07   3.351 0.001637 ** 
## Box.Office.SeasonSummer   1.202e+08  3.465e+07   3.469 0.001164 ** 
## Box.Office.SeasonWinter   7.917e+06  1.809e+07   0.438 0.663766    
## genre1Adventure          -4.951e+07  2.932e+07  -1.688 0.098248 .  
## genre1Biography          -1.046e+07  2.123e+07  -0.493 0.624710    
## genre1Comedy             -2.789e+06  1.589e+07  -0.176 0.861431    
## genre1Crime              -3.805e+07  2.233e+07  -1.704 0.095233 .  
## genre1Drama              -2.419e+07  2.260e+07  -1.070 0.290135    
## genre1Horror              3.516e+07  2.132e+07   1.649 0.106040    
## genre1Mystery            -2.027e+07  4.014e+07  -0.505 0.616040    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36940000 on 45 degrees of freedom
## Multiple R-squared:  0.8939, Adjusted R-squared:  0.8562 
## F-statistic:  23.7 on 16 and 45 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(bo4)

vif(bo4)
##                            GVIF Df GVIF^(1/(2*Df))
## Season.Daily.Avg      22.273033  1        4.719431
## Budget                 3.974359  1        1.993579
## Director.Perf.Pts     13.203012  1        3.633595
## Lead1.Perf.Points     14.529188  1        3.811717
## Rating.Genre.Perf.Pts  2.649588  1        1.627756
## Box.Office.Season     38.563123  4        1.578598
## genre1                 4.034130  7        1.104760
durbinWatsonTest(bo4)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.01069427       2.00018   0.668
##  Alternative hypothesis: rho != 0

Finally, we plot the resulting model to evaluate how close our predictions are to actual data:

par(mfrow = c(1, 1))
plot(predict(bo4),all_vars$Box.Office.Gross,
     xlab="Predicted Box Office Gross $",ylab="Actual Box Office Gross $")
abline(a=0,b=1)


 

Predicting Future Performance

Based on the best model we have made predictions for the Box office gross. Here is the data for the client’s upcoming theatrical releases:

Box.Office.Gross
Season.Gross
Season.YoY.Change
Season.Days
Season.Daily.Avg
Season.Movie.Count
Season.Movie.Avg
Budget
runtime
imdb_rating
imdb_votes
metacritic
Director.Perf.Pts
Lead1.Perf.Points
Lead2.Perf.Points
Rating.Genre.Perf.Pts
Box.Office.Season
rating.group
genre1
Box.Office.Gross Season.Gross Season.YoY.Change Season.Days Season.Daily.Avg Season.Movie.Count Season.Movie.Avg Budget runtime imdb_rating imdb_votes metacritic Director.Perf.Pts Lead1.Perf.Points Lead2.Perf.Points Rating.Genre.Perf.Pts Box.Office.Season rating.group genre1
0 NA -0.006 58 19.6 96 11.9 0 105 6.3 14496 53 0 0 0 2570 Winter PG-13 Action
0 NA -0.152 122 31.0 245 15.4 0 105 6.3 14496 53 0 0 0 60 Summer R Horror
0 NA 0.103 58 20.8 157 7.7 0 105 6.3 14496 53 0 0 0 40 Fall PG-13 Comedy
0 NA -0.672 59 16.0 64 14.8 0 105 6.3 14496 53 0 10 0 40 Holiday R Adventure

Finally, we predict future box office revenue using our model:

predict(bo4,feat)
##         1         2         3         4 
##  81059502 308588176 151097635 264754577

 

 


Conclusion

Interpretation:

-If a movie of duration 105 mins with action genre is released in winter season the Box office gross is expected to be $81,059,502.
-If a movie of duration 105 mins with horror genre is released in summer season the Box office gross is expected to be $308,588,176.
-If a movie of duration 105 mins with comedy genre is released in fall season the Box office gross is expected to be $15,109,763.
-If a movie of duration 105 mins with adventure genre is released in holiday season the Box office gross is expected to be $264,754,577.